In this report we will see how we do the normalization, the batach effect correction of multiome data, for each methodology.
library(Signac)
library(Seurat)
library(ggplot2)
library(ggpubr)
library(tidyverse)
library(EnsDb.Hsapiens.v86)
library(plyr)
library(reshape2)
library(data.table)
library(GenomicRanges)
library(harmony)
library(hdf5r)
library(stringr)
library(ggpubr)
library(RColorBrewer)
library(magick)
library(knitr)
library(biovizBase)
library(patchwork)
set.seed(123)
# Paths
path_to_data <- here::here("results/R_objects/")
path_to_multiome_metadata <- here::here("data/tonsil_atlas_metadata_multiome.csv")
path_to_save <- here::here("results/R_objects/")
Extraction of gene annotations from EnsDb using hg38 as the reference assembly.
annotation <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)
seqlevelsStyle(annotation) <- "UCSC"
genome(annotation) <- "hg38"
tonsil_merged <- readRDS(paste0(path_to_data,"1.tonsil_filtered_merged_all.rds"))
#become the first rowname as a column called "barcodes"
tonsil_merged@meta.data <- tibble::rownames_to_column(tonsil_merged@meta.data, "lib_name_barcode")
rn<-tonsil_merged@meta.data$lib_name_barcode
m1<-str_extract(rn, "BCLL\\_[\\d|\\d{2}]+\\_\\w\\_\\d")
tonsil_merged@meta.data$library_name<-m1
#Add rownames to metadata
rownames(tonsil_merged@meta.data)<-tonsil_merged@meta.data$lib_name_barcode
For normalization, Signac performs a two-step technique called TF-IDF (Term Frequency - Inverse Document Frequency). This technique normalizes the cells in order to correct possible differences in sequencing depth and normalize the peaks, increasing the signal of rare peaks. Here, we will use the RunTFIDF function.
Since scATAC-seq has a low signal/noise ratio is more difficult to select the top feature (peaks) as we do in scRNA-seq (genes). That is why we use the FindTopFeatures() command to select the top x% of the peaks for dimensionality reduction, or remove peaks that are present in less than x cells. In this case we will use all the min.cutoff=“q0” meaning that we will select the 100% of peaks.
Finally, for dimensionality reduction, we perform a mathematical technique called SVD or Singular Value Decomposition on the matrix returned by TD-IDF (SVD is also performed to generate PCAs) and using only the peaks selected in the previous step. We call LSI or latent semantic indexing the technique that combines a TF-IDF step followed by an SVD step, and it was successfully used for the first time in scATAC-seq analysis by Cusanovich et al., 2015.
We exclude the first dimension as this is typically correlated with sequencing depth Cells cluster completely separately in ATAC without harmony; so run harmony after SVD
RunSVD LSI
DefaultAssay(tonsil_merged) <- "ATAC"
tonsil_merged <- RunTFIDF(tonsil_merged)
## Performing TF-IDF normalization
## Warning in RunTFIDF.default(object = GetAssayData(object = object, slot =
## "counts"), : Some features contain 0 total counts
tonsil_merged <- FindTopFeatures(tonsil_merged, min.cutoff = "q0")
tonsil_merged <- RunSVD(tonsil_merged)
## Running SVD
## Scaling cell embeddings
Compute the correlation between total counts and each reduced dimension component.
LSI component is typically highly correlated with sequencing depth. The first LSI component often captures sequencing depth (technical variation) rather than biological variation. If this is the case, the component should be removed from downstream analysis. We can assess the correlation between each LSI component and sequencing depth using the DepthCor() function:
For scRNA-seq data we don’t typically observe such a strong relationship between the first PC and sequencing depth, and so usually retain the first PC in downstream analyses.
DepthCor(tonsil_merged)
Here we see there is a very strong correlation between the first LSI component and the totalnumber of counts for the cell, so we will perform downstream steps without this component.
tonsil_merged <- RunUMAP(
tonsil_merged,
dims = 2:40,
reduction = "lsi",
reduction.name = "umap.atac",
reduction.key = "atacUMAP_"
)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 13:49:31 UMAP embedding parameters a = 0.9922 b = 1.112
## 13:49:31 Read 11954 rows and found 39 numeric columns
## 13:49:31 Using Annoy for neighbor search, n_neighbors = 30
## 13:49:31 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 13:49:32 Writing NN index file to temp file /var/folders/kz/y10np0cj213fybqz181z9lghbzr42p/T//Rtmp4H1Byz/file16d17ea13452
## 13:49:32 Searching Annoy index using 1 thread, search_k = 3000
## 13:49:35 Annoy recall = 100%
## 13:49:36 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 13:49:38 Initializing from normalized Laplacian + noise (using irlba)
## 13:49:39 Commencing optimization for 200 epochs, with 491382 positive edges
## 13:49:46 Optimization finished
atac.umap<-DimPlot(
tonsil_merged,
reduction = "umap.atac",
group.by = "library_name",
pt.size = 0.1
) + ggtitle('scATAC UMAP') + NoLegend()
atac.umap
#split_by: library ,edad, genero
atac.umap
Our aim is to detect and exclude empty droplets or deth cells (lysed cells). Lysed cells have 3 hallmarks: - (1) low library size (total UMI), - (2) low library complexity (number of detected genes) a - (3) high fraction of mitochondrial expression (cytosolic mRNA leaks out of the cell).
DefaultAssay(tonsil_merged) <- "RNA"
tonsil_merged <- NormalizeData(
tonsil_merged,
normalization.method = "LogNormalize",
scale.factor = 1e4
)
tonsil_merged <- tonsil_merged %>%
FindVariableFeatures(nfeatures = 3000) %>%
ScaleData() %>%
RunPCA()
PCAPlot(tonsil_merged,
group.by = "library_name")
ElbowPlot(object = tonsil_merged)
tonsil_merged <- RunUMAP(
tonsil_merged,
dims = 1:30,
reduction = "pca",
reduction.name = "umap.rna",
reduction.key = "rnaUMAP_"
)
## 13:50:13 UMAP embedding parameters a = 0.9922 b = 1.112
## 13:50:13 Read 11954 rows and found 30 numeric columns
## 13:50:13 Using Annoy for neighbor search, n_neighbors = 30
## 13:50:13 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 13:50:14 Writing NN index file to temp file /var/folders/kz/y10np0cj213fybqz181z9lghbzr42p/T//Rtmp4H1Byz/file16d17699ec011
## 13:50:14 Searching Annoy index using 1 thread, search_k = 3000
## 13:50:17 Annoy recall = 100%
## 13:50:19 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 13:50:21 Initializing from normalized Laplacian + noise (using irlba)
## 13:50:21 Commencing optimization for 200 epochs, with 508910 positive edges
## 13:50:29 Optimization finished
rna.umap<-DimPlot(
tonsil_merged,
reduction = "umap.rna",
group.by = "library_name",
pt.size = 0.1) + NoLegend() + ggtitle('scRNA UMAP')
rna.umap
plot(rna.umap)
atac.umap + rna.umap
Pass the Seurat object to the RunHarmony function and specify which variable to integrate out. A Seurat object is returned with corrected Harmony coordinates.
DefaultAssay(tonsil_merged) <- "ATAC"
tonsil_merged <- RunHarmony(
object = tonsil_merged,
reduction = "lsi",
dims = 2:40,
group.by.vars = "library_name",
assay.use = "ATAC",
project.dim = FALSE,
reduction.save = "harmony_atac"
)
tonsil_merged <- RunUMAP(
tonsil_merged,
dims = 2:40,
reduction = "harmony_atac",
reduction.name = "umap.atac",
reduction.key = "atacUMAP_"
)
## 13:50:55 UMAP embedding parameters a = 0.9922 b = 1.112
## 13:50:55 Read 11954 rows and found 39 numeric columns
## 13:50:55 Using Annoy for neighbor search, n_neighbors = 30
## 13:50:55 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 13:50:56 Writing NN index file to temp file /var/folders/kz/y10np0cj213fybqz181z9lghbzr42p/T//Rtmp4H1Byz/file16d171b82a257
## 13:50:56 Searching Annoy index using 1 thread, search_k = 3000
## 13:51:00 Annoy recall = 100%
## 13:51:01 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 13:51:03 Initializing from normalized Laplacian + noise (using irlba)
## 13:51:04 Commencing optimization for 200 epochs, with 498624 positive edges
## 13:51:12 Optimization finished
Harm_peak<-DimPlot(
tonsil_merged,
reduction = "umap.atac",
group.by = "library_name",
pt.size = 0.1
) + NoLegend() + ggtitle('Peak Harmony')
harmony in RNA-seq
tonsil_merged <- RunUMAP(
tonsil_merged,
dims = 2:40,
reduction = "harmony_rna",
reduction.name = "umap.rna",
reduction.key = "rnaUMAP_"
)
## 13:51:28 UMAP embedding parameters a = 0.9922 b = 1.112
## 13:51:28 Read 11954 rows and found 39 numeric columns
## 13:51:28 Using Annoy for neighbor search, n_neighbors = 30
## 13:51:28 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 13:51:29 Writing NN index file to temp file /var/folders/kz/y10np0cj213fybqz181z9lghbzr42p/T//Rtmp4H1Byz/file16d1757c8836a
## 13:51:29 Searching Annoy index using 1 thread, search_k = 3000
## 13:51:32 Annoy recall = 100%
## 13:51:33 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 13:51:35 Initializing from normalized Laplacian + noise (using irlba)
## 13:51:36 Commencing optimization for 200 epochs, with 512340 positive edges
## 13:51:44 Optimization finished
Harm_rna<-DimPlot(
tonsil_merged,
reduction = "umap.rna",
group.by = "library_name",
pt.size = 0.1
) + NoLegend() + ggtitle('RNA Harmony')
Harm_peak+Harm_rna + plot_annotation(title = 'Harmony ATAC and RNA UMAP visualization')
FindModalNeighbors
This function will construct a weighted nearest neighbor (WNN) graph. For each cell, we identify the nearest neighbors based on a weighted combination of two modalities. Takes as input two dimensional reductions, one computed for each modality. Other parameters are listed for debugging, but can be left as default values.
# build a joint neighbor graph using both assays
tonsil_merged <- FindMultiModalNeighbors(
object = tonsil_merged,
reduction.list = list("harmony_atac", "harmony_rna"),
dims.list = list(2:40, 1:30),
modality.weight.name = "Joint_wnn_umap"
)
## Calculating cell-specific modality weights
## Finding 20 nearest neighbors for each modality.
## Calculating kernel bandwidths
## Warning in FindMultiModalNeighbors(object = tonsil_merged, reduction.list
## = list("harmony_atac", : The number of provided modality.weight.name is not
## equal to the number of modalities. ATAC.weight RNA.weight are used to store the
## modality weights
## Finding multimodal neighbors
## Constructing multimodal KNN graph
## Constructing multimodal SNN graph
Then we are going to build a joint UMAP visualization base on the wnn graph.
tonsil_merged <- RunUMAP(
object = tonsil_merged,
nn.name = "weighted.nn",
reduction.name = "wnn.umap",
reduction.key = "wnnUMAP_")
## 13:52:16 UMAP embedding parameters a = 0.9922 b = 1.112
## 13:52:17 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 20
## 13:52:19 Initializing from normalized Laplacian + noise (using irlba)
## 13:52:19 Commencing optimization for 200 epochs, with 361778 positive edges
## 13:52:28 Optimization finished
joint.umap<- DimPlot(tonsil_merged, label = FALSE, group.by = "library_name", pt.size = 0.1, reduction = "wnn.umap") + plot_annotation(title = 'Joint UMAP')+ ggtitle('Joint UMAP by library name') + NoLegend()
joint.umap
FindClusters
Identify clusters of cells by a shared nearest neighbor (SNN) modularity optimization based clustering algorithm. First calculate k-nearest neighbors and construct the SNN graph. More info here
#find cluster algorithm 3 = SLM algorithm
tonsil_merged <- FindClusters(tonsil_merged, resolution = 0.02,algorithm = 3, graph.name = "wsnn",verbose = FALSE)
print(colnames(tonsil_merged@meta.data))
## [1] "lib_name_barcode" "orig.ident" "nCount_RNA"
## [4] "nFeature_RNA" "nCount_ATAC" "nFeature_ATAC"
## [7] "nucleosome_signal" "nucleosome_percentile" "TSS.enrichment"
## [10] "TSS.percentile" "tss.level" "percent_mt"
## [13] "percent_ribo" "nCount_peaks" "nFeature_peaks"
## [16] "library_name" "ATAC.weight" "RNA.weight"
## [19] "wsnn_res.0.02" "seurat_clusters"
DimPlot(
tonsil_merged,
group.by = "wsnn_res.0.02",
reduction = "wnn.umap",
pt.size = 0.1, label = T
)+ ggtitle('Joint UMAP by resolution 0.02')
Also, we can see the UMAP for each methodology base on the same clustering.
DimPlot(
tonsil_merged,
group.by = "wsnn_res.0.02",
reduction = "umap.rna",
pt.size = 0.1, label = T
)+ ggtitle('scRNA UMAP by resolution 0.02')
DimPlot(
tonsil_merged,
group.by = "wsnn_res.0.02",
reduction = "umap.atac",
pt.size = 0.1, label = T
)+ ggtitle('scATAC UMAP by resolution 0.02')
saveRDS(tonsil_merged,paste0(path_to_save,"2.tonsil_merged_harmony.rds"))
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] es_ES.UTF-8/es_ES.UTF-8/es_ES.UTF-8/C/es_ES.UTF-8/es_ES.UTF-8
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] patchwork_1.1.2 biovizBase_1.44.0
## [3] knitr_1.40 magick_2.7.3
## [5] RColorBrewer_1.1-3 hdf5r_1.3.7
## [7] harmony_0.1.0 Rcpp_1.0.9
## [9] data.table_1.14.4 reshape2_1.4.4
## [11] plyr_1.8.7 EnsDb.Hsapiens.v86_2.99.0
## [13] ensembldb_2.20.2 AnnotationFilter_1.20.0
## [15] GenomicFeatures_1.48.4 AnnotationDbi_1.60.0
## [17] Biobase_2.58.0 GenomicRanges_1.50.1
## [19] GenomeInfoDb_1.34.2 IRanges_2.32.0
## [21] S4Vectors_0.36.0 BiocGenerics_0.44.0
## [23] forcats_0.5.2 stringr_1.4.1
## [25] dplyr_1.0.10 purrr_0.3.5
## [27] readr_2.1.3 tidyr_1.2.1
## [29] tibble_3.1.8 tidyverse_1.3.2
## [31] ggpubr_0.4.0 ggplot2_3.3.6
## [33] sp_1.5-0 SeuratObject_4.1.2
## [35] Seurat_4.2.0 Signac_1.8.0
## [37] BiocStyle_2.24.0
##
## loaded via a namespace (and not attached):
## [1] rappdirs_0.3.3 rtracklayer_1.56.1
## [3] scattermore_0.8 bit64_4.0.5
## [5] irlba_2.3.5.1 DelayedArray_0.24.0
## [7] rpart_4.1.19 KEGGREST_1.38.0
## [9] RCurl_1.98-1.9 generics_0.1.3
## [11] cowplot_1.1.1 RSQLite_2.2.18
## [13] RANN_2.6.1 future_1.28.0
## [15] bit_4.0.4 tzdb_0.3.0
## [17] spatstat.data_3.0-0 xml2_1.3.3
## [19] lubridate_1.8.0 httpuv_1.6.6
## [21] SummarizedExperiment_1.28.0 assertthat_0.2.1
## [23] gargle_1.2.1 xfun_0.34
## [25] hms_1.1.2 jquerylib_0.1.4
## [27] evaluate_0.17 promises_1.2.0.1
## [29] fansi_1.0.3 restfulr_0.0.15
## [31] progress_1.2.2 dbplyr_2.2.1
## [33] readxl_1.4.1 igraph_1.3.5
## [35] DBI_1.1.3 htmlwidgets_1.5.4
## [37] spatstat.geom_2.4-0 googledrive_2.0.0
## [39] ellipsis_0.3.2 backports_1.4.1
## [41] bookdown_0.29 biomaRt_2.52.0
## [43] deldir_1.0-6 MatrixGenerics_1.10.0
## [45] vctrs_0.5.0 here_1.0.1
## [47] ROCR_1.0-11 abind_1.4-5
## [49] cachem_1.0.6 withr_2.5.0
## [51] BSgenome_1.64.0 progressr_0.11.0
## [53] checkmate_2.1.0 sctransform_0.3.5
## [55] GenomicAlignments_1.32.1 prettyunits_1.1.1
## [57] goftest_1.2-3 cluster_2.1.4
## [59] lazyeval_0.2.2 crayon_1.5.2
## [61] labeling_0.4.2 pkgconfig_2.0.3
## [63] nlme_3.1-160 ProtGenerics_1.28.0
## [65] nnet_7.3-18 rlang_1.0.6
## [67] globals_0.16.1 lifecycle_1.0.3
## [69] miniUI_0.1.1.1 filelock_1.0.2
## [71] BiocFileCache_2.6.0 modelr_0.1.9
## [73] dichromat_2.0-0.1 rprojroot_2.0.3
## [75] cellranger_1.1.0 polyclip_1.10-4
## [77] matrixStats_0.62.0 lmtest_0.9-40
## [79] Matrix_1.5-1 carData_3.0-5
## [81] zoo_1.8-11 reprex_2.0.2
## [83] base64enc_0.1-3 ggridges_0.5.4
## [85] googlesheets4_1.0.1 png_0.1-7
## [87] viridisLite_0.4.1 rjson_0.2.21
## [89] bitops_1.0-7 KernSmooth_2.23-20
## [91] Biostrings_2.66.0 blob_1.2.3
## [93] parallelly_1.32.1 spatstat.random_2.2-0
## [95] jpeg_0.1-9 rstatix_0.7.0
## [97] ggsignif_0.6.4 scales_1.2.1
## [99] memoise_2.0.1 magrittr_2.0.3
## [101] ica_1.0-3 zlibbioc_1.44.0
## [103] compiler_4.2.1 BiocIO_1.6.0
## [105] fitdistrplus_1.1-8 Rsamtools_2.12.0
## [107] cli_3.4.1 XVector_0.38.0
## [109] listenv_0.8.0 pbapply_1.5-0
## [111] htmlTable_2.4.1 Formula_1.2-4
## [113] MASS_7.3-58.1 mgcv_1.8-41
## [115] tidyselect_1.2.0 stringi_1.7.8
## [117] highr_0.9 yaml_2.3.6
## [119] latticeExtra_0.6-30 ggrepel_0.9.1
## [121] grid_4.2.1 VariantAnnotation_1.42.1
## [123] sass_0.4.2 fastmatch_1.1-3
## [125] tools_4.2.1 future.apply_1.9.1
## [127] parallel_4.2.1 rstudioapi_0.14
## [129] foreign_0.8-83 gridExtra_2.3
## [131] farver_2.1.1 Rtsne_0.16
## [133] digest_0.6.30 BiocManager_1.30.19
## [135] rgeos_0.5-9 shiny_1.7.3
## [137] car_3.1-1 broom_1.0.1
## [139] later_1.3.0 RcppAnnoy_0.0.19
## [141] httr_1.4.4 colorspace_2.0-3
## [143] rvest_1.0.3 XML_3.99-0.11
## [145] fs_1.5.2 tensor_1.5
## [147] reticulate_1.26 splines_4.2.1
## [149] uwot_0.1.14 RcppRoll_0.3.0
## [151] spatstat.utils_3.0-1 plotly_4.10.0
## [153] xtable_1.8-4 jsonlite_1.8.3
## [155] R6_2.5.1 Hmisc_4.7-1
## [157] pillar_1.8.1 htmltools_0.5.3
## [159] mime_0.12 glue_1.6.2
## [161] fastmap_1.1.0 BiocParallel_1.30.4
## [163] codetools_0.2-18 utf8_1.2.2
## [165] lattice_0.20-45 bslib_0.4.1
## [167] spatstat.sparse_3.0-0 curl_4.3.3
## [169] leiden_0.4.3 interp_1.1-3
## [171] survival_3.4-0 rmarkdown_2.17
## [173] munsell_0.5.0 GenomeInfoDbData_1.2.9
## [175] haven_2.5.1 gtable_0.3.1
## [177] spatstat.core_2.4-4